💓Heart Attack Data Analysis🔎
🩺(Prediction at the end)🔮

What is a heart attack?
A heart attack, also called a myocardial infarction, happens when a part of the heart muscle doesn’t get enough blood.
The more time that passes without treatment to restore blood flow, the greater the damage to the heart muscle.
Coronary artery disease (CAD) is the main cause of heart attack. A less common cause is a severe spasm, or sudden contraction, of a coronary artery that can stop blood flow to the heart muscle.
What are the symptoms of heart attack?
The major symptoms of a heart attack are :
Chest pain or discomfort. Most heart attacks involve discomfort in the center or left side of the chest that lasts for more than a few minutes or that goes away and comes back. The discomfort can feel like uncomfortable pressure, squeezing, fullness, or pain.
Feeling weak, light-headed, or faint. You may also break out into a cold sweat.
Pain or discomfort in the jaw, neck, or back.
Pain or discomfort in one or both arms or shoulders.
Shortness of breath. This often comes along with chest discomfort, but shortness of breath also can happen before chest discomfort.
Exploratory Data Analysis¶
Aim :¶
- Understand the data ("A small step forward is better than a big one backwards")
- Begin to develop a modelling strategy
Features¶
Age : Age of the patient
Sex : Sex of the patient
exang: exercise induced angina (1 = yes; 0 = no)
ca: number of major vessels (0-3)
cp : Chest Pain type chest pain type :
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic
trtbps : resting blood pressure (in mm Hg)
chol : cholestoral in mg/dl fetched via BMI sensor
fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
rest_ecg : resting electrocardiographic results :
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
thalach : maximum heart rate achieved
target : 0= less chance of heart attack 1= more chance of heart attack
Base Checklist¶
Shape Analysis :¶
- target feature : output
- rows and columns : 303 , 14
- features types : qualitatives : 0 , quantitatives : 14
- NaN analysis :
- NaN (0 % of NaN)
Columns Analysis :¶
- Target Analysis :
- Balanced (Yes/No) : Yes
- Percentages : 55% / 45%
- Categorical values
- There is 8 categorical features (0/1) (not inluding the target)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from dataprep.eda import create_report
from dataprep.eda import plot_missing
from dataprep.eda import plot_correlation
from dataprep.eda import plot
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, roc_curve
from sklearn.model_selection import learning_curve, cross_val_score, GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler,StandardScaler,MinMaxScaler
import warnings
warnings.filterwarnings('ignore')
Dataset Analysis¶
data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df = data.copy()
pd.set_option('display.max_row',df.shape[0])
pd.set_option('display.max_column',df.shape[1])
df.head()
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | oldpeak | slp | caa | thall | output | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
(df.isna().sum()/df.shape[0]*100).sort_values(ascending=False)
age 0.0 sex 0.0 cp 0.0 trtbps 0.0 chol 0.0 fbs 0.0 restecg 0.0 thalachh 0.0 exng 0.0 oldpeak 0.0 slp 0.0 caa 0.0 thall 0.0 output 0.0 dtype: float64
plot_missing(df)
Missing Statistics
| Missing Cells | 0 |
|---|---|
| Missing Cells (%) | 0.0% |
| Missing Columns | 0 |
| Missing Rows | 0 |
| Avg Missing Cells per Column | 0.0 |
| Avg Missing Cells per Row | 0.0 |
print('There is' , df.shape[0] , 'rows')
print('There is' , df.shape[1] , 'columns')
There is 303 rows There is 14 columns
df.duplicated().sum()
1
df.loc[df.duplicated(keep=False),:]
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | oldpeak | slp | caa | thall | output | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 163 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
| 164 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
df.drop_duplicates(keep='first',inplace=True)
df.shape
(302, 14)
Visualising Target and Features¶
df['output'].value_counts(normalize=True) #Classes déséquilibrées
1 0.543046 0 0.456954 Name: output, dtype: float64
for col in df.select_dtypes(include=['float64','int64']):
plt.figure()
sns.displot(df[col],kind='kde',height=3)
plt.show()
X = df.drop('output',axis=1)
y = df['output']
Detailed Analysis¶
riskyDF = df[y == 1]
safeDF = df[y == 0]
plt.figure(figsize=(4,4))
sns.pairplot(data,height=1.5)
plt.show()
<Figure size 288x288 with 0 Axes>
corr = df.corr(method='pearson').abs()
fig = plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='tab10', vmin=-1, vmax=+1)
plt.title('Pearson Correlation')
plt.show()
print (df.corr()['output'].abs().sort_values())
fbs 0.026826 chol 0.081437 restecg 0.134874 trtbps 0.146269 age 0.221476 sex 0.283609 thall 0.343101 slp 0.343940 caa 0.408992 thalachh 0.419955 oldpeak 0.429146 cp 0.432080 exng 0.435601 output 1.000000 Name: output, dtype: float64
for col in df.select_dtypes(include=['float64','int64']):
plt.figure(figsize=(4,4))
sns.distplot(riskyDF[col],label='High Risk')
sns.distplot(safeDF[col],label='Low Risk')
plt.legend()
plt.show()
Comments¶
It looks like we have some very useful features here, with a correlation > 0.4. The following features seems promising for predicting wether a patient will have a heart attack or not :
- oldpeak
- exng
- cp
- thalachh
We can also notice that sip and oldpeak looks correlated, let's find out !
for col in X.select_dtypes(include=['float64','int64']):
plt.figure(figsize=(4,4))
sns.lmplot(x='oldpeak', y=col, hue='output', data=df)
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
<Figure size 288x288 with 0 Axes>
A bit of data engineering ...¶
def encoding(df):
code = {
# All columns are made of quantitative values (floats actually), so there is no need to encode the features
}
for col in df.select_dtypes('object'):
df.loc[:,col]=df[col].map(code)
return df
def imputation(df):
df = df.dropna(axis=0) # There are no NaN anyways
return df
def feature_engineering(df):
useless_columns = [] # Let's consider we want to use all the features
df = df.drop(useless_columns,axis=1)
return df
def preprocessing(df):
df = encoding(df)
df = feature_engineering(df)
df = imputation(df)
X = df.drop('output',axis=1)
y = df['output']
return df,X,y
Comments¶
We can now analyze categorical features as quantitative features (rem : no qualitative features to be encoded here)
Modelling¶
df = data.copy()
trainset, testset = train_test_split(df, test_size=0.2, random_state=0)
print(trainset['output'].value_counts())
print(testset['output'].value_counts())
1 131 0 111 Name: output, dtype: int64 1 34 0 27 Name: output, dtype: int64
_, X_train, y_train = preprocessing(trainset)
_, X_test, y_test = preprocessing(testset)
preprocessor = make_pipeline(MinMaxScaler())
PCAPipeline = make_pipeline(StandardScaler(), PCA(n_components=2,random_state=0))
RandomPipeline = make_pipeline(preprocessor,RandomForestClassifier(random_state=0))
AdaPipeline = make_pipeline(preprocessor,AdaBoostClassifier(random_state=0))
SVMPipeline = make_pipeline(preprocessor,SVC(random_state=0,probability=True))
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier())
LRPipeline = make_pipeline(preprocessor,LogisticRegression(solver='sag'))
PCA Analysis¶
PCA_df = pd.DataFrame(PCAPipeline.fit_transform(X))
PCA_df = pd.concat([PCA_df, y], axis=1)
PCA_df.head()
| 0 | 1 | output | |
|---|---|---|---|
| 0 | 0.603024 | 2.291914 | 1.0 |
| 1 | -0.478588 | -0.988416 | 1.0 |
| 2 | -1.847655 | 0.020559 | 1.0 |
| 3 | -1.724377 | -0.490040 | 1.0 |
| 4 | -0.403288 | 0.278693 | 1.0 |
plt.figure(figsize=(8,8))
sns.scatterplot(PCA_df[0],PCA_df[1],hue=PCA_df['output'],palette=sns.color_palette("tab10", 2))
plt.show()
Classification problem¶
dict_of_models = {'RandomForest': RandomPipeline,
'AdaBoost': AdaPipeline,
'SVM': SVMPipeline,
'KNN': KNNPipeline,
'LR': LRPipeline}
def evaluation(model):
model.fit(X_train, y_train)
# calculating the probabilities
y_pred_proba = model.predict_proba(X_test)
# finding the predicted valued
y_pred = np.argmax(y_pred_proba,axis=1)
print('Accuracy = ', accuracy_score(y_test, y_pred))
print('-')
print(confusion_matrix(y_test,y_pred))
print('-')
print(classification_report(y_test,y_pred))
print('-')
N, train_score, val_score = learning_curve(model, X_train, y_train,
cv=4, scoring='f1',
train_sizes=np.linspace(0.1,1,10))
plt.figure(figsize=(12,8))
plt.plot(N, train_score.mean(axis=1), label='train score')
plt.plot(N, val_score.mean(axis=1), label='validation score')
plt.legend()
for name, model in dict_of_models.items():
print('---------------------------------')
print(name)
evaluation(model)
---------------------------------
RandomForest
Accuracy = 0.8852459016393442
-
[[24 3]
[ 4 30]]
-
precision recall f1-score support
0 0.86 0.89 0.87 27
1 0.91 0.88 0.90 34
accuracy 0.89 61
macro avg 0.88 0.89 0.88 61
weighted avg 0.89 0.89 0.89 61
-
---------------------------------
AdaBoost
Accuracy = 0.9016393442622951
-
[[25 2]
[ 4 30]]
-
precision recall f1-score support
0 0.86 0.93 0.89 27
1 0.94 0.88 0.91 34
accuracy 0.90 61
macro avg 0.90 0.90 0.90 61
weighted avg 0.90 0.90 0.90 61
-
---------------------------------
SVM
Accuracy = 0.8524590163934426
-
[[21 6]
[ 3 31]]
-
precision recall f1-score support
0 0.88 0.78 0.82 27
1 0.84 0.91 0.87 34
accuracy 0.85 61
macro avg 0.86 0.84 0.85 61
weighted avg 0.85 0.85 0.85 61
-
---------------------------------
KNN
Accuracy = 0.8852459016393442
-
[[22 5]
[ 2 32]]
-
precision recall f1-score support
0 0.92 0.81 0.86 27
1 0.86 0.94 0.90 34
accuracy 0.89 61
macro avg 0.89 0.88 0.88 61
weighted avg 0.89 0.89 0.88 61
-
---------------------------------
LR
Accuracy = 0.8360655737704918
-
[[20 7]
[ 3 31]]
-
precision recall f1-score support
0 0.87 0.74 0.80 27
1 0.82 0.91 0.86 34
accuracy 0.84 61
macro avg 0.84 0.83 0.83 61
weighted avg 0.84 0.84 0.83 61
-
Using AdaBoost¶
AdaPipeline.fit(X_train, y_train)
y_proba = AdaPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)
print("Adaboost : ", accuracy_score(y_test, y_pred))
Adaboost : 0.9016393442622951
y_pred_prob = AdaPipeline.predict_proba(X_test)[:,1]
fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)
plt.plot(fpr,tpr,label='AdaBoost ROC Curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("AdaBoost ROC Curve")
plt.show()
Using KNN¶
KNNPipeline.fit(X_train, y_train)
y_proba = KNNPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)
print("KNN : ", accuracy_score(y_test, y_pred))
KNN : 0.8852459016393442
KNN Optimization¶
err = []
for i in range(1, 40):
model = make_pipeline(preprocessor,KNeighborsClassifier(n_neighbors = i))
model.fit(X_train, y_train)
pred_i = model.predict(X_test)
err.append(np.mean(pred_i != y_test))
plt.figure(figsize =(10, 8))
plt.plot(range(1, 40), err, color ='blue',
linestyle ='dashed', marker ='o',
markerfacecolor ='blue', markersize = 8)
plt.title('Mean Err = f(K)')
plt.xlabel('K')
plt.ylabel('Mean Err')
Text(0, 0.5, 'Mean Err')
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier(n_neighbors = 7))
KNNPipeline.fit(X_train, y_train)
y_proba = KNNPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)
print("KNN : ", accuracy_score(y_test, y_pred))
KNN : 0.9016393442622951